1 Overview & Motivation

Ever since COVID-19 erupted into our world, research institutes and governments have released plenty of databases publicly to allow research groups and independent individuals to analyze the data around the coronavirus’s spread. We are facing an unprecedented public health crisis with the Coronavirus (Covid-19) outbreak. We believe that data-driven decisions, and people working together for the greater good, are one of the better ways to tackle and deal with this difficult time.

In this blog, we are interested to know ‘How the world’s news media is covering the COVID-19 pandemic?’ Building on the massive television news narratives dataset GDELT released a powerful news dataset of the URLs, titles, publication dates and brief snippet of more than 1.1 million worldwide English language online news articles mentioning the virus to enable researchers and journalists to understand the global context of how the outbreak has been covered since November 2019. This dataset has been expanding daily and includes a number of related topics.

A single article on Covid-19 can cover various topics like health, business implications of the disease or climate changes or it could just be a front to propagate fake information. Given the huge amount of news articles floating around the web in the wake of Covid-19, it is very difficult to compile and compare the news articles. To conduct an analysis of what is being discussed during these difficult times, we would have to first collect all the news articles and then annotate them according to their implicit news sub-categories. This motivates us to create an approach such that we could annotate news articles on Coronavirus without any manual intervention. By creating such a pipeline we not only aim to help researchers, media persons and Journalists to have access to similar articles but also avoid the overhead of time spent in reading and understanding unrelated articles. Thus we aim to improve the quality of similar articles and thus topics representing them.

We intend to solve the huge flow of information called “information overload” which makes it harder for users to find similar information on Covid-19 on the internet. We solve this with an application that enables the user to find news of their query/interest effortlessly. We are foreseeing some challenges, that include determining the subtopic, extract only the content of each webpage and present the data to the user. In real-world applications, multi-label classification (MLC) has a lot of utility in which objects can be identified by more than one label. It’s costly and tedious to manually label the dataset. An unsupervised learning approach should, therefore, be considered to take advantage of clustering similar datasets and eventually doing topic modelling to multi-label the clusters. We use unsupervised learning technique(Clustering) to group a collection of articles so that articles in the same category are more similar to each other than those in other groups. Clustering can be used to help classify the types of a structure discovered.

We are trying to analyze the large set of news articles to help make it easier for common people to filter through many articles related to the virus, and find their own resoluteness.Furthermore, we would want to understand the semantic relations between different topics. And finally, analyze keywords to uncover patterns in the news content.

3 Research Questions

Can we find articles with similar topics to a given an article ?
In order to answer this question, we need to answer the following reasearch questions:
1. What is the most dominant topic in the article?
2. How to determine the value of K is best suited for topic modeling on our dataset
3. How does the topic model perform with different features, namely Term frequency–Inverse document     frequency (Tf - Idf) along with Baf of Words and Bag of words (BoW) by itself.

4 Dataset

5 Exploratory Analysis

5.1 Overview of the Dataset with plots :

5.1.1 Distribution of Articles :

ggplot(main_df) +
aes(x = original_label) +
geom_bar(position = "dodge", fill = "#4292c6") +
theme_linedraw()
...

5.1.2 Wordcloud :

Using Bag of Words Model with Term Frequency Weighting scheme.

wordcloud(words = d_bow$word, freq = d_bow$freq, min.freq = 1,
          max.words=100,scale = c(4, 0.2), random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

5.1.3 Wordcloud :

Using Bag of Words Model with TF-IDF Weighting scheme.

wordcloud(words = d_tfidf$word, freq = d_tfidf$freq,scale = c(2, 0.1), min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

6 Topic Modelling using LDA for Visualization :

6.1 Dimensionality Reduction Using Tsne :

6.3 Clustering :

6.3.1 Evaluation :

6.3.1.1 Silhouette Coefficient :

6.3.1.1.1 8 clusters :
...

6.3.1.1.2 15 clusters :
...

6.3.2 Convex Hull Plot 1 :

...

6.3.3 Convex Hull Plot 2 :

...

6.3.4 Sankey Network Diagram :

links <- data.frame(
  source = top_terms$topic,
  target = top_terms$term,
  value = top_terms$beta
)

nodes <- data.frame(
  name=c(as.character(links$source),
         as.character(links$target)) %>% unique()
)

# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1
# Make the Network
p <- sankeyNetwork(Links = links, Nodes = nodes,
                   Source = "IDsource", Target = "IDtarget",
                   Value = "value", NodeID = "name",
                   colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20);"),
                   sinksRight=FALSE,fontSize = 16,height = 1400,width = 1200,
                   nodePadding = 8, fontFamily = "arial",unit = "Letter(s)")
p
sankeyNetwork

6.3.5 Chord Diagram :

chordDiagram(new_v,big.gap = 10,directional = 1, direction.type = c("diffHeight", "arrows"),link.arr.type = "big.arrow", diffHeight = -mm_h(1),grid.col = c("violet", "blue4", "blue","green", "yellow","tomato","red","cyan4","deeppink","cyan3","chocolate4","darkslategrey","darksalmon","chartreuse","darkorchid2","deepskyblue1","lightcoral", "palegreen4", "paleturquoise2","palevioletred", "peru", "pink4", "purple2","sienna1","skyblue2","seagreen2","rosybrown","plum3","slateblue2","orange3","darkgoldenrod2","salmon2","pink2")
...

6.3.6 Probability Distribution of Topics in each Cluster :

7 Topic Modelling using LDA for Prediction :

7.1 Bag of Words with Term Frequency as Weighting Scheme :

7.1.1 PC plot for Gibbs Sampling as Model 1 :

...

7.1.2 PC plot for Dot Product as Model 2 :

...

7.2 Bag of Words with TF-IDF as Weighting Scheme :

7.2.1 PC plot for Gibbs Sampling as Model 1 :

...

7.2.2 PC plot for Dot Product as Model 2 :

...

7.3 Model Evaluation :

Below metrics were used for evaluating model.

7.3.1 Perplexity :

...

7.3.2 Likelihood :

...

7.4 R bokeh plot :

8 Final Analysis

What did you learn about the data? How did you answer the questions? How can you justify your answers?

9 Team members

Name Email-Id Mattr No.
Calida Pereira calida.pereira@st.ovgu.de 229945
Chandan Radhakrishna chandan.radhakrishna@st.ovgu.de 229746
Nandish Bandi Subbarayappa nandish.bandi@st.ovgu.de 229591
Mohit Jaripatke mohit.jaripatke@st.ovgu.de 224651
Priyanka Bhargava priyanka.bhargava@st.ovgu.de 229675

© 2020 GitHub, Inc. Terms Privacy Security Status Help Contact GitHub Pricing API Training Blog About